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We demonstrate a meaningful prospective power analysis for an (admittedly idealized) il- 
lustrative connectome inference task. Modeling neurons as vertices and synapses as edges in 

.2 a simple random graph model, we optimize the trade-off between the number of (putative) 

edges identified and the accuracy of the edge identification procedure. We conclude that explicit 
analysis of the quantity /quality trade-off is imperative for optimal neuroscientific experimen- 
tal design. In particular, identifying edges faster/more cheaply, but with more error, can yield 

CN superior inferential performance. 

En Introduction 

Statistical inference on graphs begins with modeling graph- valued observations G = (V,E), 
where V = {1, ■ ■ ■ ,n} is the vertex set and E C V x V is the edge set (conections between 
vertices), via a random graph model G ~ Pq 6 V = {P$ : € 0}. The parameter 6 governs 
the distribution Pg over the collection Q n of possible graphs on n vertices, and the parameter set 
indexes the distributions in our model. Inference then proceeds via estimation or hypothesis 
testing regarding the true but unknown parameter value 6q £ 0. 

Statistical inference on connectomes - graphs representing brain structure - involves positing 
a probabilistic model for the connectome and deriving desirable properties of a statistic (a 
function of G or G) with respect to neuroscientific questions regarding ^o- 

For example, electron microscopy (EM) and magentic resonance (MR) imaging technology 
can produce high-resolution connectome data. In EM, the connectome is the graph obtained by 
representing neurons as vertices and synapses as edges. In MR, the vertices represent voxels or 
neuroanatomical regions and the edges represent functional, effective, or structural connectivities. 
It is estimated that a human or primate cortical column contains approximately 100,000 neurons, 
each with about 10,000 connections, yielding approximately one billion synapses. Thus one 
could hope to observe a massive graph and perform inference thereon, yielding fundamentally 
important neuroscientific understanding. 
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However, given the imaging data, one must estimate the graph. Typically, these brain- 



graphs are obtained by tedious and time consuming manual annotation. In Bock et al. 2011 



approximately nine expert-human months were required to find 250 synapses in EM imagery of 
the mouse primary visual cortex - about 1 synapse per expert-human day. At that rate, it would 
take nearly 300 million expert-human years to recover the full induced subgraph of a primate 
cortical column. It is possible that annotating more quickly - and more errorfully - would yield 
superior statistical inference. Indeed, regardless of the scale of the connectome, or the imaging 
technology, there is an inherent quantity /quality trade-off in statistical connectomics. 

In this manuscript, we present an (admittedly idealized) illustrative setting in which we 
optimize the quantity /quality trade-off analytically, demonstrating that identifying brain-graph 
edges faster /more cheaply, but with more error, can yield superior inference in statistical connec- 
tomics. We describe a very simple brain-graph model and a correspondingly simple error model 
to explicate how the quantity /quality trade-off impacts the power of a particular hypothesis test. 



Connectomic Motivation 



The connections made by cortical brain cells are anatomically nanoscopic, yet each cell in the 
cortex has several centimeters of local anatomical "wiring" ( Braitenberg and Schiiz 1998] ) . This 



wiring packs the cortical volume essentially completely. Bock et al. 
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recently characterized 

the in vivo responses of a group of cells in mouse visual cortex, then imaged a volume of brain 
containing the cells using a custom-built high throughput electron microscopy (EM) camera 
array. Each voxel in the resulting data set occupies about 4 x 4 x 45 cubic nanometers of 
brain; the 10 teravoxel volume spans 450 x 350 x 50 cubic micrometers. The imaged volume 
is of sufficient size and resolution that they were able to trace the local connectivity of the 
physiologically characterized cells. One can therefore record what cells in the brain are doing 
and then trace their connectivity - a combination which could enable a new level of understanding 
of cortical circuits to be achieved. 



Model & Hypotheses 

Let G be an independent edge stochastic block model on n vertices. 

Vertices represent neurons. We denote by £ the collection of he excitatory neurons and by 
X the collection of ni inhibitory neurons. Thus the vertex set V is decomposed as the disjoint 
union V = £ U X. Let n = | V| = \£\ + \X\ = he + ni with he = An and nj = (1 — X)n for some 
AG (0,1). 

Edges represent synapses. We consider loopy graphs - a neuron may connect to itself. We 
consider the undirected edge case for simplicity; the directed and multigraph cases follow mutatis 
mutandis. 

The block model structure is such that 

P[u ~ v] = pee for u,v££, 
P[u ~ v] = pn for u, v E X, 
P[u ~ v] = pei = Pie otherwise. 

That is, the probability that two excitatory neurons connect to one another is given by random 
graph model parameter pee, the probability that two inhibitory neurons connect to one another 
is given by pn, and the probability that an excitatory neuron connects to an inhibitory neuron 
is given by pei = Pie (necessarily equal if and only if we are considering the undirected case). 
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We assume that the excitatory-excitatory connection rate pee and the inhibitory-inhibitory 
connection rate pjj are equal. We propose to test the hypothesis that this common rate pee = Pi I 
is equal to the excitatory-inhibitory connection rate Pei'- 

Hq : Pee = Pn = Pei 
vs. 

H A ■ PEE = PII < PEI ■ 
The available data are an observed collection of putative or errorful edges. 

Data 

Ideally, we would observe the entire induced subgraph G = Q(V;G*) for our imaged volume, 
where G* is the entire brain graph (connectome) . Practically, since identifying edges is expensive, 
for large n we will observe a subgraph - a subset of edges. For i = 1, ■ ■ ■ , z, we define the random 
variable Xi representing a perfect edge observation via the "tracing algorithm" given by 

(1) a neuron: choose a vertex Vi uniformly at random from V . 

(2) a synapse: choose an edge Vi ~ • uniformly at random from among edges incident to V{. 

(3) the post-synaptic neuron: identify vertex Wi for Vi ~ Wi. 

(4) the nature of the synapse: Xj = I{vi,Wi £ £ or Vi,Wi £ I}. 

• If Wi is not in our imaged volume, or if the edge = axon-synapse-dendrite goes outside our 
imaged volume so that we cannot trace it, we try again (with the same i). 

• If Vi ~ Wi is by chance previously identified, we try again (with the same i). 

However, the algorithm above requires perfect tracing. Unfortunately, perfect edge observations 
are expensive, even for a subset of edges. Instead, with error probability e G [0, 1] we fail to 
correctly identify Wi and instead identify a randomly chosen vertex Wi in Step (3) above, yielding 

(3') identify Wi for V{ ~ wf, with probability (1 — e) Wi = Wi, otherwise Wi is random. 

(4') Xi = I{vi,Wi £ £ or Vi, to, <E 1}. 

Thus Xi = 1 if the i th (putative) edge is either an excitatory-excitatory connection or an 
inhibitory-inhibitory connection, and Xi = if the i th (putative) edge is an excitatory-inhibitory 
connection. The edge tracing algorithm generates z such e-errorful edges, yielding an errorful 
subgraph observation model. 

Trade-off 

Presumably, the expense of edge tracing is increasing in putative edge count z for fixed edge 
tracing error e and decreasing in e for fixed z. For fixed resources (imaging resolution and/or 
manual or automatic edge tracing resources) the trade-off of interest is z vs. e - quantity vs. 
quality. The number of e-errorful edges that will be traced, z = h(e), is an increasing function of 
e. We derive below the optimal operating point of the quantity/quality trade-off in a particular 
connectome inference task. 

Of course, committing more resources (higher-resolution imaging and/or additional edge 
tracing resources) should yield larger z for the same e or smaller e for the same z; a prospective 
power cost/benefit analysis can be performed to aid in the decision regarding commitment of 
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resources. Still, the quantity/quality trade-off is an essential component of any such cost /benefit 
analysis, since one would want to consider the optimal quantity/quality operating point for each 
level of resource commitment. 

Inference 

We derive the expression for 

P[Xi = l]=Px = Px(n,X,PEE,PEI,£) 

= (l-e)( XPEEUE + (1 ~ X)Pnni ) + £ (2A 2 - 2A + 1). 
\PEE n E + PEini pwai + pi E n E J 

Note that under Ho the value of p^(n, X,Pee,Pei, e) is independent of the value of pee = PEI, 
and that p^ is smaller under the alternative hypothesis pee < PEI than under the null. Since 
we have (approximately) independent random variables Xj ~ Bernoulli(p^), we reject for small 

values of the test statistic X z = | J2i=i X% based on having observed z errorful edges. Assuming 
independent errors, this test is uniformly most powerful (UMP). Applying the central limit 
theorem under both Hq and Ha yields a large n large z normal approximation for the power of 
the level a test, 

P[X Z < c a \H A ] = & >e 



where p°~ and p~ denote the value of the Bernoulli parameter p^(n, \,Pee,Pei,z) given above 
under Hq (whatever be the value of pee = Pei) and under Ha (for specific values of pee < Pei), 
respectively. 

Perfect edge tracing (e = 0) for z edges yields power /3 Zj o > a. Errorful edge tracing (e > 0) 
for z putative edges yields power /3 ZjE < (3 z fl. As expected, more error yields less power for fixed 
putative edge count z: £\ < £2 implies /3 Zj£l > /3 Zi£2 and [3 Z; i = a for any z. Furthermore, more 
edges yields more power for fixed edge tracing error rate e: Z\ > Z2 implies /3 zi e > Pz 2 ,e f° r an Y E - 
However, we can identify equivalent sample size z' £ such that /3 Z > £ ~ /3 Z; o. Thus, if errorful edge 
tracing is sufficiently less expensive so that we can trace more than z' £ errorful edges compared 
to just z perfect edges (which is plausible since perfect edge tracing is expensive while errorful 
edge tracing should be less so) then inferential performance based on an errorful subgraph will 
be superior to inferential performance based on a perfect subgraph. This suggests that we may 
benefit from optimizing the quantity /quality trade-off with respect to power for fixed resources. 

Example 

A mouse cortical column (the existence of which is admittedly the subject of neuroscientific 
debate; we proceed with an illustrative example regardless) has approximately 10,000 neurons. 
With parameter values n = 10000, A = 0.9, pee = Pll = 0.1, pei = 0.2 for the random graph 
model G, we expect the graph to have roughly 6 million edges total in the induced subgraph. 
High-accuracy manual edge tracing results in approximately one edge per expert per day. For 



/3 Z:£ (n,X,p E E,PEi;a 
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these parameter values, testing at level a = 0.05 yields 



/3 50 ,o ~ 0.429, 
/3 5 o,o.5 ~ 0.196, 
/3 25 o,o.5 ~ 0.488. 

Thus our prospective power analysis demonstrates that less expensive errorful edge tracing can 
be inferentially superior to more expensive perfect edge tracing: if we can trace z = 50 edges 
perfectly (e = 0) we obtain power ,850,0 ~ 0.429 (compared to degraded power ,850,0.5 ~ 0.196 
with the same number of (putative) edges [z = 50) and 50% edge tracing error (e = 0.5)), 
while if we can trace z = 250 putative edges with 50% edge tracing error (e = 0.5) we obtain 
significantly improved power ,$250,0.5 ~ 0.488 > ,850,0 ~ 0.429. The equivalent sample size for this 
example is z' 5 = 178, so that /3i78,o.5 ~ ,850,0 ~ 0.429; thus tracing more than 178 50%-errorful 
putative edges yields higher power than that obtained with 50 errorless edges. 

Extending this example, we assume that z = h{e). That is, the number of (errorful) putative 
edges that we can trace with edge tracing error e is given by some (increasing) function h of 
e. Thus the power /3(e) obtained when using the edge tracing algorithm engineered to produce 
z = h(e) putative edges with edge tracing error e is given by 

/3(e) = 3>(g(e)) 

where 

_ p|(e)(l - p ^( e ))^(a) + h(e) 1 l 2 {p\{e) - p A ~{e)) 

9{£) ~ PpKi-Pp)) 

Assuming that h is differentiable with respect to e on [0, 1), we obtain 

^ = <f>(g(e))g'(e). 

Then we evaluate the sign of ^| £=£o at the current edge tracing algorithm operating point eo; 
^|| £=eo > implies less expensive more errorful (larger e) edge tracing (resulting in larger z) 
will yield increased power, while ^|| e=eo < implies that inference will improve with more 
accurate but more expensive edge tracing (resulting in fewer putative edges). Finding e* such 
that j^\ £=£ * = will (after checking appropriate side conditions) yield optimal power /?* = /3(e*). 
To continue with our example, we consider for illustration 

200 . , , 

z = h(e) = 50 H — 7— sin(e7r/2), 

sm(7r/4) 

designed to give h(0) = 50, /3(0) « 0.429 and h(l/2) = 250, /3(l/2) w 0.488 for consistency with 
our running example. This h suggests that 50 expert days yields z = 50 at e = and z = 250 
at e = 0.5; investigation into the precise character of an appropriate h will be a necessary. 
For the specified h in our example we calculate the optimal operating point for the edge tracing 
algorithm, obtaining e* ~ 0.247 and resulting in h{e*) ~ 157 and /3(e*) ~ 0.599. Thus optimizing 
the quantity /quality trade-off has yielded an improvement in power of almost 40%. We should 
engineer our edge tracing to operate at error rate e* ~ 0.247. 
A summary of this example is presented in Figure [T] 
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Discussion 



We conclude that we can indeed do a meaningful prospective power analysis for this connectome 
inference task, and that analysis of the quantity /quality trade-off between error in edge tracing 
and the number of putative edges traced is imperative for optimal neuroscientific experimental 
design. 

The significance of our "admittedly idealized" illustrative setting is a simple version of a 
general question of scientific interest: how does connectivity probability depend on the neurons 
in question? Real scientific interest lies in more elaborate graph models and hypotheses - K > 2 
kinds of cells and K 2 connection probabilities, or even an unknown number of cell types. The 
method described here can be generalized to these more realistic settings - some maintaining 
analytic tractability, but many realistic complex generalizations will of course require us to resort 
to numerical approximation methods. In hypothesis testing, one wishes to collect data in such 
a way so as to maximize the probability of rejecting the null hypothesis given that it is false. 
Often, the data collector is limited to only experimental intuition in making the quantity /quality 
decision. In some cases, however, one can turn to statistical connectomics to shed light on the 
quantitative trade-offs one expects with regard to a particular statistical inference question. 
Specifically, we have demonstrated that one can approximate the optimal operating point for 
the (errorful) edge tracing algorithm. 

The above example uses statistical connectomics to address an important decision in neu- 
roscientific data collection and analysis. While the results presented apply to a special case, 
general lessons can be learned. 

In particular, we see that explicitly modeling the quantity /quality trade-off can yield signif- 
icant inferential advantages. Note that the optimal operating point depends heavily on specific 
model assumptions; thus, any conclusions from such a prospective analysis are subject to the 
adequacy of those assumptions. We emphasize that this analysis is fundamentally a function of 
the particular inference task. Although we have outlined analytical results for one specific (i) 
inference task, (ii) graph model, (iii) error model, and (iv) quantity/quality trade-off function, 
each of these components must be customized for the neuroscientific question at hand. 

The example results presented herein depend on knowing the quantity /quality function 
z = h{e). In general, this function will not be known, but it can be estimated. Specifically, 
consider the scenario of manual annotation of EM data. The performance of trained edge trac- 
ers operating so as to target various putative edge counts can be calibrated against a "gold" 
standard - derived, perhaps, using independent, complementary imaging methods - providing 
an estimate of h. As the size of connectome data sets continues to increase, the number of 
manual annotators required to estimate massive connectomes gets impractically large. There- 



fore, we will rely on machine vision algorithms to annotate the data (cf. the Open Connectome 



Project). The quantity/quality trade-off applies to such algorithms as surely as it applies to 



manual annotation, and the quantity/quality function h will need to be estimated. 

The implications of optimizing the quantity /quality trade-off in connectome inference are po- 
tentially substantial in light of the recent global investment in connectome science. For example, 



the USA National Institute of Health (NIH) has budgeted over $30 million to the Human Connec- 



tome Project which aims to collect and analyze human magnetic resonance (MR) connectomes. 



Similarly, the European Union (EU) is potentially granting up to € 1 billion to the Human Brain 



Project Understanding and exploiting the quantity /quality trade-off in connectome inference 



will be essential to the efficient use of the available resources. 
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Figure 1: Power f3 and its derivative as functions of the edge tracing error rate e for our example 
scenario (see text for details). (We plot (§)§f (e) so that the two curves are on approximately the 
same scale and can productively be presented on the same plot.) 
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